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Abstract 

In this paper, we propose a visual tracker based on a metric-weighted linear representation of appearance. In order to 
capture the interdependence of different feature dimensions, we develop two online distance metric learning methods using 
proximity comparison information and structured output learning. The learned metric is then incorporated into a linear 
representation of appearance. We show that online distance metric learning significantly improves the robustness of the 
tracker, especially on those sequences exhibiting drastic appearance changes. In order to bound growth in the number of 
training samples, we design a time-weighted reservoir sampling method. 

Moreover, we enable our tracker to automatically perform object identification during the process of object tracking, by 
introducing a collection of static template samples belonging to several object classes of interest. Object identification results 
for an entire video sequence are achieved by systematically combining the tracking information and visual recognition at 
each frame. Experimental results on challenging video sequences demonstrate the effectiveness of the method for both 
inter-frame tracking and object identification. 

Index Terms 

Visual tracking, linear representation, structured metric learning, reservoir sampling 

I. Introduction 

Visual tracking is an important and challenging problem in computer vision, with widespread application domains. 
Its goal is to consistently locate an object of interest in multiple images captured at successive time steps. Despite 
great progress in recent years, visual tracking remains a challenging problem because of the complicated appearance 
changes caused by factors including illumination variation, shape deformation, occlusion, pose variation, background 
clutter, sophisticated object motion, and scene blurring. To address these factors, a variety of tracking approaches 
have been proposed to improve the robustness, speed, or accuracy of visual tracking. In order to effectively capture 
the dynamic spatio-temporal information on object appearance, these tracking approaches aim to learn generative or 
discriminative appearance models using a variety of statistical learning techniques, including hidden Markov model [1], 
mixture models [2], subspace learning [3]—[5], linear regression [6], [7], [5], [8], covariance learning [9], compressive 
tracking [10], SVMs [11]—[14], boosting [15], [16], random forest [17], spatial attention learning [18], metric learning [19], 
[20], and tracking-learning-detection [21]. 

Linear representations, in which the object is represented as a linear combination of basis samples, are often used 
to build such appearance models. Suppose that we have a set of basis samples denoted as: T = [Ti ... T q ] e lZ dxq . 
Using these basis samples, a new sample y can be approximated by the following linear combination: y « Tc = 

ciTi +C 2 T 2 -I- \-c q T q , where c = (ci, C 2 ,..., c q ) T is a coefficient vector. This gives rise to the following reconstruction 

error norm: D y = ||y — (ciTi + C 2 T 2 + • • • + c q T g )|| 2 . In this case, the smaller D y is, the more likely y is generated 
from the subspace spanned by T. During visual tracking, the subspace is likely to vary dynamically as new data arrive, 
so T should be accordingly adjusted to the new data. In addition, the feature representation for each sample is typically 
extracted from local image patches, whose appearance is often correlated because of their spatial proximity. Therefore, 
the elements of the feature representation in different dimensions often contain intrinsic spatial correlation information. 
In general, such correlation information is stable in the case of complicated appearance changes (e.g., partial occlusion, 
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illumination variation, and shape deformation), and thus plays an important role in robust visual tracking. However, 
most existing linear representation-based trackers (e.g., [6], [7]) build linear regressors that treat feature dimensions 
independently and ignore the correlation between them. 

Therefore, we address the following three key issues for tracking robustness and efficiency: i) how to capture the 
intrinsic correlation between different feature dimensions; ii) how to maintain and update a limited-sized basis sample 
buffer T that effectively adapts during tracking; and iii) given this correlation information and a dynamically changing 
basis T, how to efficiently compute the optimal coefficient vector c at each frame. 

The core of our approach to these problems is online metric learning, which learns and updates a distance metric 
matrix over time. Within this framework, the three issues listed above are solved as follows. For i), inter-dimensional 
correlation is captured by the learned Mahalanobis distance metric matrix M such that M = L L, where L projects the 
feature vector to a more discriminative feature space. In other words, online metric learning aims to find a linear mapping 
(i.e., a set of linear combinations over the correlated feature elements from different feature dimensions), which projects 
the original samples to a more discriminative feature space for robust visual tracking. We compare two different metric 
learning methods, one of which uses structured learning while the other is based only on pairwise sample proximity. 
For ii), we design a time-weighted reservoir sampling method to maintain and update limited-sized sample buffers in 
the metric learning procedure. In addition, we prove that metric learning based on our reservoir sampling method is 
statistically close to metric learning using all observed training samples. For iii), we pose the calculation of c as a least- 
square optimization problem, which admits an extremely simple and efficient closed-form solution. We also demonstrate 
that, with the emergence of new data, the solution can be efficiently updated by a sequence of simple matrix operations. 

Therefore, the main contributions of this work are two-fold: 1) We propose a novel online metric-weighted linear 
representation for visual tracking. The linear representation is associated with a metric-weighted least-square optimization 
problem, which admits an extremely simple and efficient closed-form solution. The metric used in the linear representation 
is updated online in a max-margin optimization framework using proximity comparison or structured learning. Different 
from the similarity metric learning developed in [22], our work is formulated as a max-margin optimization problem for 
learning a Mahalanobis distance metric. Furthermore, we introduce the idea of structured learning to the online metric 
learning process, which is also novel in the visual tracking literature. 2) We design a time-weighted reservoir sampling 
method to maintain and update limited-sized sample buffers in the linear representation. The method is able to effectively 
maintain sample buffers that not only retain some old samples to avoid tracker drift, but also adapt to recent changes. 
In addition, we theoretically prove that metric learning based on our reservoir sampling method is statistically close to 
metric learning using all available training samples. This is the first time that reservoir sampling is used in an online 
metric learning setting that is tailored for robust visual tracking. 

We note that, if the template samples all represent the same object, this same procedure can be used to identify 
the object in the presence of multiple objects. The goal of object identification is naturally achieved by combining the 
tracking information and the linear regression-based visual recognition together. We obtain promising results of pedestrian 
identification on multi-view video sequences. 

Compared to previous systems, we fully exploit the linear representation of object appearance in a consistent and prin¬ 
cipled manner. By using online metric learning, our similarity measure is better maintained despite changing conditions. 
Reservoir sampling allows us to represent object appearance over time as a linear combination of samples. Our incremental 
solution update allows rapid update of object appearance coefficients, either for object tracking or identification. 

II. Related work 

Our work builds on recent progress in several related fields: i) linear representations; ii) distance metric learning; iii) 
reservoir sampling; and iv) structured tracking. We give a brief overview of the most relevant work in each of these 
areas. 

Linear representations Mei and Ling [6] propose a tracker based on a sparse linear representation obtained by 
solving an £i -regularized minimization problem. With the sparsity constraint, this tracker can adaptively select a small 
number of relevant templates to optimally approximate the given test samples. The main limitation is its computational 
expense due to solving an ^i-norm convex problem. To speed up the tracking, Bao et al. [23] take advantage of a 
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fast numerical solver called accelerated proximal gradient, which solves the £\ -regularized minimization problem with 
guaranteed quadratic convergence. An alternative is to solve the t \-regularized minimization problem in an approximate 
way. For example, Li et al [7] propose to approximately solve the sparse optimization problem using orthogonal matching 
pursuit (OMP). Recently, research has revealed that the ^i-norm induced sparsity does not necessarily help improve the 
accuracy of image classification; and non-sparse representations are typically orders of magnitude faster to compute than 
their sparse counterparts, with competitive accuracy [24], [25]. Subsequently, Li et al [26] propose a 3D discrete cosine 
transform based multilinear representation for visual tracking. The representation models the spatio-temporal properties of 
object appearance from the perspective of signal compression and reconstruction. However, the above trackers construct 
linear regressors that are defined on independent feature dimensions (mutually independent raw pixels in both [6] and 
[7]). In other words, the correlation information between different feature dimensions is not exploited. Such correlation 
information can play an important role in robust visual tracking. 

Distance metric learning The goal of distance metric learning is to seek an effective and discriminative metric space, 
where both intra-class compactness and inter-class separability are maximized. In general, distance metric learning [27]- 
[30] is a popular and powerful tool for many applications. For example, in [27], a Mahalanobis distance metric is learned 
using positive semidefinite programming. Chechik et al [22] propose a cosine similarity metric learning method using 
proximity comparison for large-scale image retrieval. Discriminative metric learning has also been successfully applied to 
visual tracking [19], [31], [20]. These works learn a distance metric mainly for object matching across adjacent frames, 
and the tracking is not carried out in the framework of linear representations. In addition, Hong et al [32] learn a 
discriminative distance metric in a max-margin framework, where the average inter-class distance is maximized while 
minimizing the average intra-class distance. The distance metric learning approach is implemented in a batch mode 
learning scheme, which does not allow for online updating required for visual tracking. 

Reservoir sampling Visual tracking is a time-varying process which deals with a dynamic stream data in an online 
manner. Due to memory limitations, it is often impractical for trackers to store all the video stream data. To address this 
issue, reservoir sampling is a means of maintaining and updating limited-sized data buffers. However, the conventional 
reservoir sampling in [33], [34] can only accomplish the task of uniform random sampling, which assumes all samples 
are equally important. Due to temporal coherence, visual tracking in the current frame usually relies more on recently 
received samples than old samples. Hence, time-weighted reservoir sampling is required for robust visual tracking. 

Structured tracking The objective of structured tracking is to utilize the intrinsic structural information on object 
appearance for robust object tracking. For instance, Jia et al [35] construct a structured sparse appearance model, which 
performs alignment pooling on the sparse coefficient vectors for local patches within the object. Similar to [35], Zhong 
et al [36] also propose a structured appearance model that carries out average pooling on the sparse coefficient vectors 
for local image patches within the object. Likewise, Li et al [37] present a local block-division appearance model that 
comprises a set of block-specific SVM classification models. The Dempster-Shafer evidence theory is further used to 
fuse the block-specific SVM discriminative information for object localization. In addition, structured output learning 
(e.g., structured SVM [13]) is applied to visual tracking. Its key idea is to learn a classification model in a max-margin 
optimization framework, which involves an infinite number of constraints containing structured information (e.g., VOC 
overlap score [13]). 

Tracking and identification Recent studies have demonstrated the effectiveness of combining object identification 
and tracking together. For example, Edwards et al [38] present an adaptive framework that improves the performance 
of face tracking and recognition by adaptively combining the motion information from the video sequence. Following 
this work, Zhou et al [39] study the problem of simultaneous tracking and recognizing human faces in a particle filter 
framework. Moreover, Mei and Ling [6] formulate simultaneous vehicle tracking and identification as a £\-regularized 
sparse representation problem. 

Compared with the previous work on appearance modeling, the advantages of this work are as follows. First, this 
work constructs a simple but effective appearance model based on a metric-weighted linear regression problem, which 
admits an extremely simple and efficient closed-form solution with online updating. Second, this work naturally embeds 
useful discriminative information within the process of appearance modeling using online metric learning, which finds 
an effective metric space (obtained by discriminative linear mappings) for metric-weighted linear regression. Third, this 
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work effectively maintains limited-sized sample buffers for online metric learning by time-weighted reservoir sampling. 
The maintained buffers can not only retain some old samples with a long lifespan for avoiding the tracker drift, but also 
adapt to recent appearance changes. A preliminary conference version of this work appears in [40]. 

III. The proposed visual tracking algorithm 

A. Particle filtering for tracking 

At the top level, visual tracking is posed as a sequential object state estimation problem, which is often solved in a 
particle filtering framework [41]. The particle filter can be divided into prediction and update steps: 

/'(Z, (.Vi) (X Jp(Z t \Z t ^i)p(Z t -i\O t -i)dZ t -i, p(Z t \O t ) xp(o t \Z t )p(Z t \O t 

where O t = {oi,... ,o t } are observation variables, and Z t = yuS t ) denotes the motion parameters including A 
translation, y translation, and scaling. The key distributions are p(o t \Z t ) denoting the observation model, and p(Z t |Z t _i) 
representing the state transition model. Usually, the motion between two consecutive frames is assumed to conform to 
a Gaussian distribution: p(Z t \Z t -i) = M(Z t ] Z t _i, E), where E denotes a diagonal covariance matrix with diagonal 
elements: a 2 x , cry, and cr|. For each state Z t , there is a corresponding image region o t that is normalized by image 
scaling. The optimal object state Z\ at time t can be determined by solving the following maximum a posterior (MAP) 
problem: Z* t = argmaxz t p(Z t \O t ). Therefore, efficiently constructing an effective observation model p(o t \Z t ) plays a 
critical role in robust visual tracking. Motivated by this observation, we design a metric-weighted linear representation 
that captures the intrinsic object appearance properties in a discriminative distance metric space. 

B. Problem formulation 

Modeling the observed appearance of an object p(o t \Z t ) is more complex than modelling its motion. This is often 
posed as a problem of linear representation and reconstruction, which corresponds to a ^ p -norm regularized least-square 
optimization problem (e.g., solved in [6], [7]). These optimization problems usually ignore the relative importance of 
individual feature dimensions as well as the correlation between feature dimensions. During tracking, such information 
plays a critical role in robust object/non-object classification with complicated appearance variations. Motivated by this 
observation, we propose a metric-weighted linear representation that is capable of capturing the varying correlation 
information between feature dimensions. As shown in Fig. 1, metric learning results in a linear representation that is 
more discriminative for object/non-object classification. 

Metric-weighted linear representation. More specifically, given a set of basis samples P = (pi)G 7 Z dxN and 
a test sample y G lZ dxl , we aim to discover a linear combination of P to optimally approximate the test sample y by 
solving the following optimization problem: 

ming(x; M, P, y) = min (y - Px) T M(y - Px), (1) 

X X 

where x G 7 Z Nxl and M is a symmetric distance metric matrix that can be decomposed as M = L L. In principle, the 
idea of the metric-weighted linear representation is to linearly reconstruct the given test sample y using the basis samples 
( Pi)iLi within a distance metric space (characterized by the Mahalanobis metric matrix M). The aforementioned linear 
regression problem is equivalent to the following form: min (Ly — LPx) T (Ly — LPx). In other words, we perform 

X 

the linear reconstruction task on the transformed sample Ly with respect to the transformed basis samples (Lpi)^ 1 . 
When L is an identity matrix, our metric-weighted regression problem degenerates to a standard least square regression 
problem. 

The optimization problem (1) has an analytical solution that can be computed as: 

x* = (P t MP) -1 P t My. (2) 

If p t mp is a singular matrix, we use its pseudoinverse to compute x*. 

Tracking application. During tracking, we typically want to classify a candidate sample as either foreground or 
background. We are therefore interested in the relative similarity of the sample to a set of foreground and background 
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Fig. 1 : Illustration of our discriminative criterion based on metric-weighted linear representation. The first column displays the original frames; the 
second column shows the corresponding confidence maps without metric learning (i.e., M is an identity matrix); and the third column exhibits the 
corresponding confidence maps with metric learning. Clearly, our metric-weighted criterion is more discriminative. 


Algorithm 1 Metric-weighted linear representation 

Input: The current distance metric matrix M, the basis samples P — (Pi)fLi C 7 Z dxN , any test sample y £ 7 Z dxl . 

Output: The optimal linear representation solution x* of sample y. 

1) Build the optimization problem in Equ. (1): 

min g(x; P, y) = min (y - Px) T M(y - Px) 

X X 

2) Compute the optimal solution x* = (P T MP) _1 P T My. If samples are added to or removed from P, (P T MP) -1 can be 
efficiently updated in an online manner: 

• Use Equ. (4) to compute the incremental inverse. 

• Use Equ. (5) to calculate the decremental inverse. 

• Obtain the replacement inverse based on the incremental and decremantal inverses. 

3) Return the optimal solution x*. 


samples. To address this problem, we obtain the foreground and background linear regression solutions as follows: 
Xj = arg min X/ g(xy; M, P/, y) and xj = arg min Xb ^(x^; M, P&, y), where P/ and P& are foreground and background 
basis samples, respectively. 

Thus, we can define a discriminative criterion for measuring the similarity of the test sample y to the foreground 
class: 

S(y) = <j [exp(-6f/j f ) - pexp(-6 b /^ b )], (3) 

where 7 / and 7 ^ are two scaling factors, Of = g(xj; M, P/, y), 0^ = g(xj; M, P 5 , y), p is a trade-off control factor, and 
cr[•] is the sigmoid function. Here, the term exp (— 6 ^/ 7 /) reflects the reconstruction similarity relative to the foreground 
class, while exp(—# 5 / 76 ) determines the similarity with the background class. Greater exp(— Of/^f) with a smaller 
exp(— Ob/^b) indicates a stronger confidence for foreground prediction. 

During tracking, the similarity score <S(-) is associated with the observation model of the particle filter such that 
p(o t \Z t ) oc S(o t ). 

Implementing this framework involves three main challenges: i) maintaining a representative pool of foreground and 
background samples; ii) efficiently updating the solution when foreground or background samples are updated; iii) learning 
and updating the metric matrix M. These are addressed in the following four sections. 

C. Online solution update 

The main computational time of Equ. (2) is spent on the calculation of (P T MP) -1 . For computational efficiency, we 
need to incrementally or decrementally update the inverse when a sample is added to or removed from P, for a fixed 
metric M. Motivated by this observation, we design an online update scheme that deals with the following three cases: 
1) the basis sample matrix P is incrementally expanded by one column such that P n = (P Ap); 2) the basis sample 
matrix P is decrementally reduced by one column such that P Q is the reduced matrix after removing the i-th column of 
P; and 3) one column of P is replaced by a new sample. 

Incremental case. Let P n = (P Ap) denote the expanded matrix of P. Clearly, the following relation holds: 

{ P t MP P t MA P \ 

\JAp) T MP (Ap) T MAp J ’ 


(P n ) T MP : 
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Algorithm 2 Online metric learning using proximity comparison 

Input: The current distance metric matrix and a new triplet (p, p+, p _ ). 
Output: The updated distance metric matrix M fe+1 . 

1) Calculate a+ = p — p+ and a = p p 

2) Compute the optimal step length ? 7 =min max 

3) M fc+1 «— M fc + ?7(a_a^ — a+a^). 


0 , 


l+aq_ — a_ M fc a_ 


2a J Ua_ - 


Hull 


:}} 


with U = aaf — a+af 


Algorithm 3 Online structured distance metric learning 

Input: The current distance metric matrix M. k , the current tracking bounding box Rt and its associated feature vector p t . 
Output: The updated distance metric matrix M fc+1 . 

1) Most violated constraint set V = 0 

2) Sample a number of bounding boxes around Rt to construct a constraint set for the optimization problem (14). 

3) Compute the most violated constraint (/z, v) as shown in Equ. (15). 

4) Add (p^ o Rj 4 , o RJ') to the most violated constraint set such that V «— V IJKp^ ° R£\ Pt ° Rf)}• 

5) Solve the optimization problem (18) to obtain the optimal step length vector rj* . 

6) Compute the updated metric matrix M according to Sec. III-D. 

7) Repeat Steps 3-6 until convergence (restricted by a maximum iteration number). 

8) Return M fc+1 <- M. 


For simplicity, let H = (P T MP) x , c = P T MAp, and r = (Ap) T MAp. Since M is a symmetric matrix, c T = 
(Ap) T MP. The inverse of (P n ) T MP n can be computed as [42]: 


((PnfMPj- 1 


H 


Hcc j H 
r—c T Hc 
c t H 
r — c T Hc 


He \ 
-c T Hc / 


( 4 ) 


Decremental case. Let P Q denote the reduced matrix of P after removing the i -th column such that 1 < i < N. 
Based on [42], the inverse of (P 0 ) T MP 0 can be computed as: 


((PofMPo )- 1 


H 


H (*,*) 


( 5 ) 


where Xi = {l,2 ,...,A}\{z} stands for the index set except i. 

Replacement case. For adapting to object appearance changes, it is necessary for trackers to replace an old sample 
from the buffer with a new sample. Sample replacement is implemented in two stages: 1) old sample removal; and 2) 
new sample addition, corresponding to the decremental and incremental cases described above. 

The complete optimization procedure including online sample update is summarized in Algorithm 1. 


D. Online proximity based metric learning 

Having introduced the metric-weighted linear representation in Sec. III-B, we now address the key issue of calculating 
the metric matrix M. M should ideally be learned from the visual data, and should be dynamically updated as conditions 
change throughout a video sequence. 

1) Triplet-based ranking loss: Suppose that we have a set of sample triplets {(p,p + ,p - )} with p,p + ,p - E 1Z d . 
These triplets encode the proximity comparison information. In each triplet, the distance between p and p + should be 
smaller than the distance between p and p - . 

The Mahalanobis distance under metric M is defined as: 

-Dm(p, q) = (p - q) T M(p - q). (6) 

Clearly, M must be a symmetric and positive semidefinite matrix. It is equivalent to learn a projection matrix L such 
that M = LL t . In practice, we generate the triplets set as: p and p + belong to the same class and p and p - belong to 
different classes. So we want the constraints Dm(p,P + ) < ^m(p,P _ ) to be satisfied as well as possible. By putting 
it into a large-margin learning framework, and using the soft-margin hinge loss, the loss function for each triplet is: 


^m(p,P + ,P ) = max{0,1 + L> m (p,P + ) --Dm(p,P )}• 


(7) 
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2) Large-margin metric learning: To obtain the optimal distance metric matrix M, we need to minimize the global 
loss Lm that takes the sum of hinge losses (7) over all possible triplets from the training set: 

Lm = ^2 z m(p,P + ,P"), (8) 

(p,p + ,p - )eQ 

where Q is the triplet set. To sequentially optimize the above objective function Lm in an online fashion, we design an 
iterative algorithm to solve the following convex problem: 

M /c+1 = argmin±||M - M k \\ 2 F + C£, 

m (9) 

s.t. L>m(p, P _ ) - £>m(p, P + ) > 1 - £, £ > 0, 

where \\-\\f denotes the Frobenius norm, £ is a slack variable, and C is a positive factor controlling the trade-off between 
the smoothness term |||M —and the loss term £. Following the passive-aggressive mechanism used in [22], [43], 
we only update the metric matrix M when /m(p, p + , p - ) >0. 

3) Optimization of M: We optimize the function in Equ. (9) with Lagrangian regularization: 


L(M, 77 , £,/?) = £ ||M - M k \\ 2 f + - ft + 7/(1 - £ + D m { p, p+) - p, p”)), (10) 


where 77 > 0 and (3 > 0 are Lagrange multipliers. The optimization procedure is carried out in the following two 
alternating steps. 

• Update M. By setting = 0, we arrive at the update rule 

M /e+1 = M k + 77 U (11) 


where U = a_a^ — a + a^ and a + = p — p + , a_ = pp. 

Update 77 . Subsequently, we take the derivative of the Lagrangian (10) w.r.t. 77 and set it to zero, leading to the 
update rule: 


77 = min < C , max < 0 , 


1 + a+M /c a+ — a^M fc a_ 


2 aiUa_ - 2 a^Ua+ - ||U|| 


( 12 ) 


The full derivation of each step can be found in the supplementary file (as shown in Sec. VII). The complete procedure 
of online distance metric learning is summarized in Algorithm 2. 

4) Online matrix inverse update: When updated according to Algorithm 2, M is modified by rank-one additions 
such that M <— M + 77 (a_a^ — a + a^) where a + = p — p + and a_ = p — p - are two vectors (defined in Equ. (11)) 
for triplet construction, and 77 is a step-size factor (defined in Equ. (12)). As a result, the original P T MP becomes 
P t MP + ( 77 P T a_)(P T a_) T + (— 77 P T a + )(P T a + ) T . When M is modified by a rank-one addition, the inverse of 
P t MP can be updated according to the theory of [44], [45]: 


(J 


")-! = J ” 1 - ^ 


J 'uv'J 1 


-v T J- ] 


u 


(13) 


Here, J = P T MP, u = 77 P T a_ (or u = — 77 P T a + ), and v = P T a_ (or v = P T a + ). 


E. Online structured metric learning 

Metric learning based on sample proximity comparisons leads to an efficient online learning algorithm, but requires 
pre-defined sets of positive and negative samples. In tracking, these usually correspond to target/non-target image patches. 
The boundary between these classes typically occurs where sample overlap with the target drops below a threshold, but 
this can be difficult to evaluate exactly and thus introduces some noise into the algorithm. 

In this Section, we replace the proximity based metric learning module with an online structured metric learning 
method for learning M. The main advantage of this method is that it directly learns the metric from measured sample 
overlap, and therefore does not require the separation of samples into positive and negative classes. 

Structured ranking Let p t and pj denote two feature vectors extracted from two image patches, which are respectively 
associated with two bounding boxes H t and RJ from frame t. Without loss of generality, let us assume that H t corresponds 



to the bounding box obtained by the current tracker while RJ is associated with a bounding box from the area surrounding 
H t . As in [13], the structural affinity relationship between p t and pj is captured by the following overlap function: 
s°( Rt,RJ) = ^\ . As a result, we define the following optimization problem for structured metric learning: 

M /c+1 = argmin UM-M k \\ 2 F + C£, 

m (14) 

s.t. D M (pt,Pt) -£Mpt,Pt) > a ij 

where £ > 0 and A^ = s°( Rt,RJ) — s°(R t ,R^). Clearly, the number of constraints in the optimization problem (14) 
is exponentially large or even infinite, making it difficult to optimize. Our approach to this optimization problem differs 
from [13] in four main aspects: i) our approach aims to learn a distance metric while [13] seeks a SVM classifier; 
ii) we optimize an online max-margin objective function while [13] solves a batch-mode optimization problem; iii) 
our optimization problem involves nonlinear constraints on triplet-based Mahalanobis distance differences, while the 
optimization problem in [13] comprises linear constraints on doublet-based SVM classification score differences; and iv) 
our approach directly solves the primal optimization problem while [13] optimizes the dual problem. 

Structured optimization Inspired by the cutting-plane method, we iteratively construct a constraint set (denoted as 
V) containing the most violated constraints for the optimization problem (14). In our case, the most violated constraint 
is selected according to the following criterion: 

(M> v ) = arg max + £> M (Pt, Pt) « D M (Pt, Pt ), (15) 

(fij) 

For notational simplicity, let /m(P£ ° Rt, p J t o R^, pj o RJ) denote the loss term A^ + Dm.{pu Pt) ~ T>m(p u Pt)- Note 
that the violated constraints generated from (15) are used if and only if lm(Pt ° Rt,Pt ° R^,pJ ° RJ) is greater than 
zero. Subsequently, we add the most violated constraint to the optimization problem (14) in an iterative manner, that is, 
V <— V |J{(pf ° R*\ Pt ° Rt )}• The corresponding Lagrangian is formulated as: 

c=\ ||M - M fe H! + (C- P)Z + ETJiM - £ + D M (p uP n - D m ( p*,pD], (16) 

where /3 > 0 and rp > 0 are Lagrange multipliers. The optimization procedure is once again carried out in two alternating 
steps: 

• Update M. By setting to zero, we obtain an updated M defined as: 

\v\ 

M k+1 =M k + J2 VtUi (17) 

1=1 

where = a^(a^) T — aJ^(af £ ) T , and aj 1 denotes Pt — Pt- 

• Update rp. To obtain the optimal solution for all Lagrange multipliers rp, we set ^ = 0 for all £, leading to the 
following optimization problem: 

ry* = argmin ||Br 7 - f||i, s.t. rj b 0; 1 T rf < C. (18) 

7 ? 

where where r] = (rj i, 772, • • •, r)\v\f , f = U1J2, • • •, f\v\) with f £ being -[A MW +(a^) T M fe a£ € -(aJ?) T M fc aJ?], 
and B = (b £m ) \v\x\v\ with b £rn being 1 T {U £ o U m )l + (a£ m ) T U m a£ m - (a^ m ) T U m a^ m + (a£ m ) T U*a£ m - 

« m ) T U^. 

As before, the optimal M is updated as a sequence of rank-one additions: M <— M + 7^[a^(a^) T — aJ^(af £ ) T ]. As 
a result, the original P T MP becomes P MP + (^P T a^)(P T a^) T + (—? 7 ^P T a^ £ )(P T a^) T . When M is modified 
by a rank-one addition, the inverse of P T MP can again be updated according to the theory of [44], [45]. 

Algorithm 3 outlines the procedure of online structured metric learning. The complete derivation of these results is 
given in the supplementary file (as shown in Sec. VIII). 
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correspondence \ 


All foreground training samples for batch mode learning 


Fig. 2: Intuitive illustration of time-weighted reservoir sampling. The upper part corresponds to the foreground samples stored in the foreground 
buffer during tracking, and the lower part is associated with all the foreground samples collected in the entire tracking process. Clearly, time-weighted 
reservoir sampling encourages more recent samples to appear in the buffer, and meanwhile retain some old samples with a long lifespan. 


F. Time-weighted reservoir sampling 

In order to construct a metric-weighted linear representation (referred to in Equ. (1)) for visual tracking, we need to 
learn a discriminative Mahalanobis metric matrix M by minimizing the total ranking losses (referred to in Equ. (8)) 
over a set of training triplets Q = {(p,p + ,p - )}, which are generated from the training data (collected incrementally 
frame by frame) by proximity comparison. As tracking proceeds, the amount of collected training data increases, which 
leads to an exponential growth of the triplet set size (i.e., \Q\). As a result, the optimization required for metric learning 
(referred to in Equ. (8)) becomes computationally intractable. To address this issue, a practical solution is to maintain 
a limited-sized buffer to store only selected training triplets. However, using the training triplets from the limited-sized 
buffer (instead of all training data) for metric learning usually leads to discriminative information loss. Therefore, how 
to effectively reduce such information loss is our focus in this work. 

Inspired by the idea of reservoir sampling [33], [34], [46], [47] (i.e., sequential random sampling for statistical learning), 
we propose a sampling scheme to maintain and update the limited-sized buffer while preserving the discriminative 
information on the ranking losses as much as possible. Moreover, since the training data for tracking have to be collected 
frame by frame, the limited-sized buffer needs to be updated sequentially. Therefore, we seek a sequential sampling 
mechanism to online maintain and update the buffer, in such a way that the ranking losses for metric learning is as close 
as possible to those using all the received training samples. Reservoir sampling is one approach to this problem. 

The classical version of reservoir sampling simulates the process of uniform random sampling [33], [34] from a large 
population of sequential samples. However, this is inappropriate for visual tracking because the samples are dynamically 
distributed as time progresses. Usually, recent samples should have a greater influence on the current tracking process 
than those appearing a long time ago. Therefore, larger weights should be assigned to recent samples while smaller 
weights should be attached to old samples. Based on weighted reservoir sampling [46], [47], our sample scheme further 
takes into account the time-varying properties of visual tracking, by incorporating time-related weight information into 
the weighted reservoir sampling process. 

More specifically, we design a time-weighted reservoir sampling (TWRS) method for randomly drawing samples 
according to their time-varying properties, as listed in Algorithm 4. In the algorithm, each new sample is associated with 
a time-related weight w = q 1 with I being the frame index number corresponding to p and q > 1 being fixed. Using 
this time-related weight, a random key for indexing the new sample is generated by k = u x ! w with u ~ rand(0 ,1). 
After that, a weighted sampling procedure [47] is adopted to update the existing foreground or background sample buffer. 
Fig. 2 gives an intuitive illustration of the way that TWRS retains useful old samples while keeping sample adaptability 
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Algorithm 4 Time-weighted reservoir sampling 

Input: Current buffers Bf and B\ } together with their corresponding keys, a new training sample p, maximum buffer size ff, 
time-weighted factor q. 

Output: Updated buffers Bf and B together with their corresponding keys. 

1) Obtain the samples p £ Bf and p^ E B with the smallest keys k^ and k£ from Bf and B^, respectively. 

2) Compute the time-related weight w = q 1 with I being the corresponding frame index number of p. 

3) Calculate a key k = u^l w where u ~ rand( 0,1). 

4) Case: p E foreground. If \Bf\ < Q, Bf = Bf lj{p}; otherwise, is replaced with p when k > kf 
Case: p E background. If \B^\ < O, B\, = B\> |J{p}; otherwise, p^ is replaced with p when k > kf 

5) Return Bf and B^ together with their corresponding keys. 


to recent changes. Note that it is the first time that time-weighted reservoir sampling is used for visual tracking. 

As described in the supplementary file, Theorem IX. 1 states the relationship between the ranking losses respectively 
from the reservoir sampling-based buffer and all training data seen to date. This theorem shows that the sum of the 
ranking losses {Zm(p? p + , P - )} over the foreground and background buffers is probabilistically close to the sum of the 
empirical ranking losses {Zm(p> P^ + , Pj~)} over the received training data. 

Therefore, statistical learning based on our reservoir sampling method with limited-sized sample buffers can effectively 
approximate statistical learning using all the received training samples. In our case, reservoir sampling is used to maintain 
and update the foreground and background basis samples for discriminative distance metric learning. Hence, the TWRS 
method extends the reservoir sampling method [47] to cope with the online metric learning problem using two sample 
buffers. It is a version of reservoir sampling tailored for online triplet-based metric learning during visual tracking. 

The key benefit of TWRS is to effectively generate and maintain the limited-sized sample buffers, which encourage the 
recent samples and meanwhile retain some old samples with a long lifespan. In this way, online metric learning using the 
limited-sized sample buffers approximates that of batch-mode learning (i.e., retaining all the samples during tracking), 
which balances the effectiveness and efficiency. The key difference from other online approaches (e.g., using forgetting 
factors) is that the total learning costs for TWRS are derived from the sequentially generated limited-sized sample buffers 
(retaining the old and recent samples simultaneously). In contrast, the online learning approaches using forgetting factors 
only store the recent samples and discard the previously generated samples. The costs of using the previously generated 
samples decay recursively using a forgetting factor. As a result, the online learning approaches using forgetting factors 
may suffer from the model drift problem. 

IV. Experimental evaluation of our baseline tracker 
A. Experimental configurations 

1) Implementation details. For the sake of computational efficiency, we simply consider the object state information in 
2D translation and scaling in the particle filtering module, where the corresponding variance parameters are set to (10, 
10, 0.1). The number of particles is set to 200. In practice, the video sequences used for experiments consists of the 
targets with relatively slow motion and progressive scale variation in most cases. As a result, such settings for particle 
number and translational variances are adequate for robust visual tracking. Of course, in the case of fast motion or 
drastic object motion the parameter settings with larger variance parameters or more particles should be adopted. For 
each particle, there is a corresponding image region represented as a HOG feature descriptor ( [48]) with 3x3 cells (each 
cell is represented by a 9-dimensional histogram vector) in the five spatial block-division modes ( [49]), resulting in a 
405-dimensional feature vector for the image region. The number of triplets used for online metric learning is chosen as 
500. The maximum buffer size Q and time-weighted factor q in Algorithm 4 is set as 300 and 1.6, respectively. Similarly 
to [16], we take a spatial distance-based strategy for training sample selection. The scaling factors 7 / and 75 in Equ. (3) 
are chosen as 1. The trade-off control factor p in Equ. (3) is set as 0.1. Note that the aforementioned parameters are 
fixed throughout all the experiments. If the proposed tracker is implemented in Matlab on a workstation with an Intel 
Core 2 Duo 2.66GHz processor and 3.24G RAM, the average running time of the proposed tracker is about 0.55 second 
per frame (because of the slow for-loop operations in Matlab). In contrast, if the proposed tracker is carried out in C++ 
with multi-thread parallel computation (for greatly speeding up the for-loop operations), the running time will be greatly 
reduced (for about 0.08 second per frame). 

2) Datasets and evaluation criteria A set of experiments are conducted on eighteen challenging video sequences, 
which consist of 8 -bit grayscale images. These video sequences are captured from different scenes, and contain different 
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(a) (b) (c) (d) 

Fig. 3: Quantitative evaluation of the proposed tracker using different buffer sizes and particle numbers. The left half corresponds to the tracking 
results with different buffer sizes, while the right half is associated with the tracking results with different particle numbers. 
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7.91 
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14.83 
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0.27 

0.67 

0.18 

0.46 

0.07 

0.51 

0.30 

0.89 

0.25 

0.28 

0.08 

0.62 

CS+pixels 

68.45 

5.51 
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53.56 

78.98 

39.75 

0.19 

0.68 

0.37 

0.22 

0.17 

0.17 

0.19 

0.90 

0.47 

0.24 

0.09 

0.11 

Ll+pixels 

27.64 

160.79 

24.26 

64.07 

108.64 

12.76 

0.28 

0.08 

0.41 

0.16 

0.09 

0.52 

0.34 

0.10 

0.49 

0.16 

0.06 

0.61 


TABLE I: Quantitative evaluation of the proposed tracker using different linear representations on four video sequences. The table shows their 
average CLEs, VORs, and success rates. 


types of object motion events (e.g., human walking and car running), which are illustrated in the supplementary file. For 
quantitative performance comparison, two popular evaluation criteria are introduced, namely, center location error (CLE) 
and VOC overlap ratio (VOR) between the predicted bounding box B p and ground truth bounding box B gt such that 
VOR = areI(g P [jB g t) • If ^e VOC overlap ratio is larger than 0.5, then tracking is considered successful in that frame. 


B. Empirical analysis of parameter settings 

1) Sample buffer size. To test the effect of buffer size on reservoir sampling, we compute the average CLE and VOR 
for each video sequence using nine different sample buffer sizes. Figs. 3 (a) and (b) show the quantitative CLE and VOR 
performance on five video sequences. It is clear that the average CLE (VOR) decreases (increases) as the buffer size 
increases, and plateaus with approximately more than 300 samples. 

2) Number of particles. In general, more particles enable visual trackers to locate the object more accurately, but lead to 
a higher computational cost. Thus, it is crucial for visual trackers to keep a good balance between accuracy and efficiency 
using a moderate number of particles. Fig. 3 (c) shows the average VOC success rates (i.e., ^^^af frame^ 8 ^ 
proposed tracking algorithm on three video sequences. From Fig. 3 (c), we can see that the success rate rapidly grows 
with increasing particle number and then converges at approximately 200-300 particles for each sequence. Fig. 3 (d) 
displays the average CPU time (spent by the proposed tracking algorithm in each frame) with different particle numbers. 
It is observed from Fig. 3 (d) that the average CPU time slowly increase. 

3) Comparison of different linear representations. The objective of this task is to evaluate the performance of four 
linear representations: our linear representation with metric learning, our linear representation without metric learning, 
compressive sensing linear representation [7], and £\ -regularized linear representation [6]. For a fair comparison, we 
utilize the raw pixel features as in [7], [6]. Tab. I shows the average performance of these four linear representations in 
CLE, VOR, and success rate on four video sequences. Clearly, our linear representation with metric learning consistently 
achieves better tracking results than the three other linear representations. Please see the supplementary file for the details 
of the frame-by-frame tracking results (i.e., CLE, VOR, success rate). 

4) Evaluation of different sampling methods. Here, we examine the performance of two sampling methods: uniform [33] 
and time-weighted [47] reservoir sampling. Tab. II shows the experimental results of the two sampling methods in CLE, 
VOR, and success rate on five video sequences. From Tab. II, we can see that weighted reservoir sampling performs 
better than ordinary reservoir sampling. More results of these two sampling methods can be found in the supplementary 
file. 

5) Performance with and without metric learning. To justify the effect of different metric learning mechanisms, 
we design several experiments on five video sequences. Tab. VII shows the corresponding experimental results of 
different metric learning mechanisms in CLE, VOR, and success rate. From Tab. VII, we can see that the performance 
of metric learning is better than that of no metric learning. In addition, the performance of metric learning with no 
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Fig. 4: Quantitative comparison of the proposed tracker with weighted reservoir sampling and batch mode learning in average VOR on three video 
sequences. Clearly, the tracking performance of our weighted reservoir sampling is very close to that of batch mode learning. 


trellis70 



Random Initialization Perturbation Random Initialization Perturbation Random Initialization Perturbation 

Fig. 5: Quantitative evaluation of the proposed tracker with five different initialization configurations (obtained by moderate random perturbation on 
the original initialization setting) in VOR on three video sequences. It is clear that the proposed tracker is not very sensitive to different initialization 
configurations. 


eigendecomposition is close to that of metric learning with step-by-step eigendecomposition, and better than that of 
metric learning with final eigendecomposition. Therefore, the obtained results are consistent with those in [22]. Besides, 
metric learning with step-by-step eigendecomposition is much slower than that with no eigendecomposition which is 
adopted by the proposed tracking algorithm. 

6) Evaluation of weighted reservoir sampling and batch mode learning. To balance effectiveness and efficiency, 
weighted reservoir sampling aims to maintain limited-sized foreground and background sample buffers used for learning a 
metric-weighted linear representation. In contrast, batch mode learning requires storing all the foreground and background 
samples during tracking, which leads to expensive computation and high memory usage. Therefore, we conduct a 
quantitative comparison experiment between weighted reservoir sampling and batch mode learning on three video 
sequences, as shown in Fig. 4. From Fig. 4, we observe that the tracking performance with weighted reservoir sampling 
is able to well approximate that of batch mode learning. 

7) Effect of random initialization perturbation. Here, we aim to investigate the tracking performance of the proposed 
tracker with different initialization configurations, which are generated by moderate random perturbation (i.e., relatively 
small center location offset) on the original bounding box after manual initialization. Fig. 5 shows the average VOR 
tracking performance on three video sequences in different initialization cases. It is clearly seen from Fig. 5 that the 
proposed tracker achieves the mutually close tracking results, and is not sensitive to different initialization settings. 

8) Investigation of the trade-off control factor p. To evaluate the effect of the discriminative metric-weighted recon¬ 
struction information from foreground and background buffers, we make a quantitative empirical study of the proposed 
tracker with different configurations of p (referred to in Equ. (3)). Fig. 6 displays the quantitative average VOR tracking 
results on three video sequences using different configurations of p such that p G {0.07, 0.08, 0.10, 0.12, 0.14, 0.16, 0.19}. 
Apparently, the proposed tracker is not very sensitive to the configuration of p within a moderate range. 


C. Comparison with the state-of-the-art trackers 

To demonstrate the effectiveness of the proposed tracking algorithm, we make a qualitative and quantitative comparison 
with several state-of-the-art trackers, referred to as FragT (Fragment-based tracker [50]), MILT (multiple instance 
boosting-based tracker [16]), VTD (visual tracking decomposition [4]), OAB (online AdaBoost [15]), IPCA (incremental 
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Fig. 6: Quantitative evaluation of the proposed tracker with different settings for the trade-off control factor p such that p E 
{0.07, 0.08, 0.10, 0.12, 0.14, 0.16, 0.19}. It is observed that the proposed tracker is not very sensitive to the setting of p. 
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0.75 

0.74 

0.67 

0.61 

0.68 

0.90 

0.99 

0.98 

0.93 

0.58 

0.88 


TABLE II: Quantitative evaluation of the proposed tracker using different sampling methods on five video sequences. The table shows their average 
CLEs, VORs, and success rates. 

PCA [3]), LIT {l\ minimization tracker [6]), CT (compressive tracker [10]), Struck (structured learning tracker [13]), 
DML (discriminative metric learning tracker [19]), TLD (tracking-learning-detection [51]), ASLA (adaptive structural 
local sparse model [52]), and SCM (sparsity-based collaborative model [53]). Moreover, the proposed tracking algorithm 
has two versions that are unstructured and structured (respectively referred to as Ours and Ours+S). 

In the experiments, the following trackers are implemented using their publicly available source code: FragT, MILT, 
VTD, OAB, CT, Struck IPCA, LIT, TLD, ASLA, and SCM. For OAB, there are two different versions (namely, OAB1 
and OAB5), which are based on two different configurations (i.e., the search scale r = 1 and r = 5 as in [16]). 

Figs. 35-41 show the qualitative tracking results of the eleven trackers on several sample frames on six video sequences. 
Fig. 33 and Tab. IV report the quantitative tracking results of the eleven trackers (in CLE, VOR, and success rate) over 
several video sequences. The complete tracking results and quantitative comparisons for all the eighteen video sequences 
can be found in the supplementary file. From Fig. 33 and Tab. IV, we observe that the proposed tracking algorithm 
achieves the best tracking performance by all measures on most video sequences. In the experiments, the TLD tracker 
(using the default parameter settings) produces the incomplete tracking results over some video sequences because of 
its particular tracking-learning-detection properties (i.e., tracking reliability analysis by simultaneously performing object 
detection and optical flow-based verification). Therefore, we only show the video sequences in which the TLD tracker can 
always achieve stable tracking performances for all the frames. The reasons for the incomplete tracking results are briefly 
analyzed as follows. In principle, the TLD tracker takes a tracking-by-detection strategy that needs to simultaneously 
perform object detection as well as optical flow-based tracking verification across successive frames. In the case of severe 
occlusions (or drastic pose changes or tiny objects or strong background clutters), it adaptively evaluates the tracking 
reliability by performing optical flow-based tracking verification or object classification (whose classification score may 
be very low), and is likely to remove the unreliable tracking results, leading to the tracking unavailability over several 
frames. With the emergence of the visually feasible tracked objects, the detection component of the TLD tracker is 
automatically activated to localize the tracked objects. 

Subsequently, we briefly analyze the reasons why our tracker works well in some challenging situations. In essence, the 
foreground samples stored in the buffer approximately constitute an object manifold that contains the intrinsic structural 
information on object appearance. After distance metric learning, the object manifold encodes more discriminative 
information on object/non-object classification. If test samples are contaminated by some complicated factors (e.g., shape 
deformation, noisy corruption, and illumination variation), the intrinsic manifold structural properties of object appearance 
are very helpful to recover these test samples from contamination by manifold embedding (i.e., metric-weighted linear 
regression). Moreover, time-weighted reservoir sampling is able to ensure that the sample buffer retains useful old samples 
with a long lifespan and meanwhile adapts to recent appearance changes. Therefore, metric-weighted linear regression on 
such a sample buffer can not only alleviate the tracker drift problem but also adapt to complicated appearance changes. 
Besides, metric-weighted linear regression on the background buffer can generate the discriminative information to help 
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0.67 
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0.78 
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0.75 
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0.99 

0.99 

0.98 

0.98 

0.98 

0.97 
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0.68 
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0.67 

0.63 

0.51 

0.64 

0.92 

0.97 

0.88 

0.91 

0.40 

0.85 


TABLE III: Quantitative evaluation of the proposed tracker with different metric learning configurations on five video sequences. The table reports 
their average tracking results in CLE, VOR, and success rate. 



Fig. 7: Tracking results of different trackers over some representative frames from the “Lola” video sequence in the scenarios with drastic scale 
changes and body pose variations. 

the tracker reject several false foreground samples (caused by occlusion, out-of-plane rotation, and pose variation). Finally, 
the used features during tracking are extracted in a block-division manner. Therefore, they are capable of encoding the 
local geometrical information on object appearance, leading to the robustness to complicated scenarios (e.g., partial 
occlusion). Of course, when the appearance distinction between the foreground and background samples (especially for 
background clutter) is small, discriminative distance metric learning cannot improve the performance of metric-weighted 
linear regression. In this case, our tracker is incapable of accurately capturing the target location (e.g., the “cokell” 
sequence shown in Tab. IV) or even failing to track. 

Moreover, time weighted reservoir sampling can alleviate error accumulation during tracking. Although some false 
foreground/background samples may be added to the buffers because of tracking errors, the old samples with a long 
lifespan can effectively reduce the influence of the false foreground/background samples on metric-weighted least square 
regression, leading to robust tracking results. In other words, the metric-weighted least square regression problem has 
two types of reconstruction costs. One is based on the old samples with a long lifespan, and the other relies on the recent 
samples. Actually, the regression cost for the old samples with a long lifespan works as a regularizer that can resist tracker 
drift. In addition, metric-weighted linear regression on the background buffer can generate the discriminative information 
to help the tracker reject false foreground samples (caused by occlusion, out-of-plane rotation, and pose variation). For 
instance, some false foreground training samples are included into the foreground sample buffer at the 18th frame of the 
“seq-jd” video sequence because of partial occlusions, as shown in the supplementary demo video files. After the 19th 
frame (without occlusions), our tracker is still able to accurately localize the head target with the help of discriminative 
metric-weighted linear regression on the foreground and background sample suffers. 

Fig. 14 shows a failure case for our method during the “cokell” video sequence. As shown in Fig. 14(a), the appearance 
difference between the tracked object and its surrounding background is relatively small (i.e., they appear to be visually 
edgeless or textureless). As a result, the foreground and background metric-weighted linear reconstruction costs for these 
regions with respect to a set of foreground or background basis samples are mutually close, resulting in a low confidence 
score for discriminative object/non-object classification with several false detection hypotheses. Based on structured 
SYM learning for optimizing the structural localization measure, Struck is capable of encoding the joint feature-location 
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Fig. 8: Tracking results of different trackers over some representative frames from the “iceball” video sequence in the scenarios with partial 
occlusions, out-of-plane rotations, body pose variations, and abrupt motion. 
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Fig. 9: Tracking results of different trackers over some representative frames from the “football3” video sequence in the scenarios with motion 
blurring, partial occlusions, head pose variations, and background clutters. 


contextual information on object appearance, leading to a relatively stable tracking result. As shown in Fig. 14(b), a severe 
occlusion event leads to the almost complete disappearance of the tracked object. Consequently, both of the trackers fail 
to track in this scenario. 

Potentially, the further performance improvement can be made in the following two respects: 1) the current versions 
of the trackers are still based on several hand-crafted visual features (e.g., HOG and Haar), which are weak in adaptively 
capturing the intrinsic discriminative appearance properties of the tracked objects in various scenarios. Therefore, adaptive 
online feature learning is a potential solution to this issue. 2) the integration of tracking reliability analysis could handle 
some abnormal events like severe occlusions. If and only if the tracking results are very reliable, then the detectors or 
classifiers can be updated. 

The most competitive method to ours is Struck [13], which achieves a comparable or better tracking performance on 
five of the eighteen sequences. Struck is based on structured learning, and directly optimizes the VOC overlap criterion 
using a structured SVM formulation. In the following section, we extend our framework to optimize this same criterion 
and re-evaluate against Struck. 
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Fig. 10: Tracking results of different trackers over some representative frames from the “planeshow” video sequence in the scenarios with shape 
deformations, out-of-plane rotations, and pose variations. 



Fig. 11: Tracking results of different trackers over some representative frames from the “race” video sequence in the scenarios with background 
clutters. 

D. Empirical evaluation of structured metric learning 

To evaluate the effect of structured metric learning, we compare its tracking performance to our previous method on 
eight video sequences. For computational efficiency, the structured metric learning method takes a uniform sampling 
strategy to randomly generate a collection of bounding boxes around the current tracker bounding box. Using these 
bounding boxes, we construct a set of triplet-based structural constraints (referred to in Equ. (14)) for online structured 
metric learning. 

Tab. V reports their average frame-by-frame VORs, CLEs, and success rates on the four video sequences. Clearly, it is 
seen from Tab. V that the structured metric learning method outperforms both the non-structured metric learning method 
and the method without metric learning. 

In addition, Fig. 15 shows the average runtime performance of the tracking approach using non-structured or structured 
metric learning on the eight video sequences. It is clearly seen from Fig. 15 that structured metric learning is about 20 
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Fig. 12: Tracking results of different trackers over some representative frames from the “trellis70” video sequence in the scenarios with drastic 
illumination changes and head pose variations. 


times slower than non-structured metric learning. From Tab. V, we see that non-structured metric learning achieves 
a reasonably close tracking performance to structured metric learning. However, the computational efficiency of non- 
structured metric learning is much better than that of structured metric learning. Therefore, in the applications presented 
in the next Section, we use non-structured metric learning. 

The Struck tracker constructs an object localization scoring function based on structured SVM learning, which learns a 
linear SVM scoring function in a max-margin structured output optimization framework. Therefore, the tracking accuracy 
of the Struck tracker solely depends on the learned SVM scoring function. In the case of tracking errors, the learned 
SVM scoring function is contaminated, causing the error accumulations across frames, which often leads to tracker 
drift in complicated scenarios. In contrast, our tracker takes a data-driven strategy for object localization. Namely, our 
tracker takes advantage of reservoir sampling to effectively maintain the foreground/background buffers, which store 
the recently included samples while keeping the old samples with a long lifespan. Besides, our metric-weighted linear 
regression cost based on these old samples essentially works as a regularizer that reduces the influence of the tracking 
error accumulations, resulting in the tracking robustness. Combined with structured metric learning, our tracker has the 
capability of performing robust visual tracking in a more discriminative metric space, leading to the further performance 
improvements. 


E. Experimental summary 

Based on the obtained experimental results, we observe that the proposed tracking algorithm has the following 
properties. First, after the buffer size exceeds a certain value (around 300 in our experiments), the tracking performance 
is stable with increasing buffer size, as shown in Fig. 3. This is desirable since we do not need a large buffer size 
to achieve promising performance. Second, in contrast to many existing particle filtering-based trackers whose running 
time is typically linear in the number of particles, our method’s running time is sublinear in the number of particles, 
as shown in Fig. 3. Moreover, its tracking performance rapidly improves and finally converge to a certain value, as 
shown in Fig. 3. Third, based on linear representation with metric learning, it performs better in tracking accuracy, as 
shown in Tab. I. Fourth, it utilizes weighted reservoir sampling to effectively maintain and update the foreground and 
background sample buffers for metric learning, as shown in Tab. II. Fifth, as shown in Tab. VII, the performance of 
our metric learning with no eigendecomposition is close to that of computationally expensive metric learning with step- 
by-step eigendecomposition. Sixth, compared with other state-of-the-art trackers, it is capable of effectively adapting to 
complicated appearance changes in the tracking process by constructing an effective metric-weighted linear representation 
with weighed reservoir sampling, as shown in Tab. IV. Last, using the structured metric learning is capable of improving 
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Fig. 13: Quantitative comparison of different trackers in VOR on the first fifteen video sequences. 


the tracking performance in CLE and VOR, as shown in Tab. V. That is because the structured metric learning encodes 
the underlying the structural interaction information on data samples, which plays an important role in robust visual 
tracking. 
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B-Beam 

Lola 

trace 

Walk 

football 

iceball 

coke11 

trellis70 

dograce 

football3 

cubicle 

seq-jd 

girl 

B-Street 

planeshow 

race 

CamSeqOl 

carll 


Ours+S 

3.52 

10.67 

6.23 

5.91 

2.88 

2.75 

5.13 

4.22 

3.12 

5.63 

3.10 

3.02 

7.35 

3.11 

2.89 

6.96 

2.87 

2.62 


Ours 

4.68 

11.08 

8.65 

6.08 

3.14 

3.03 

5.76 

5.62 

3.54 

3.92 

4.31 

4.30 

7.72 

3.31 

3.00 

8.52 

3.00 

2.74 


DML 

22.03 

102.62 

91.42 

30.23 

65.34 

15.94 

30.83 

9.10 

10.57 

89.66 

5.02 

5.47 

20.26 

22.10 

3.28 

7.29 

3.74 

3.79 


FragT 

46.61 

141.00 

19.70 

92.46 

31.95 

13.86 

59.54 

39.86 

8.43 

28.52 

25.54 

5.83 

20.92 

15.29 

10.04 

45.28 

9.20 

65.31 


VTD 

15.82 

142.69 

110.25 

103.31 

9.81 

14.35 

45.66 

51.68 

24.56 

32.88 

46.11 

4.88 

10.97 

7.27 

3.64 

65.38 

7.09 

32.18 


MILT 

19.08 

138.97 

68.00 

37.08 

103.59 

76.09 

17.71 

60.41 

7.97 

9.23 

43.53 

16.24 

39.96 

41.16 

4.99 

25.55 

13.26 

44.79 


OAB1 

31.10 

146.08 

65.14 

36.47 

76.53 

24.64 

57.90 

68.58 

8.51 

6.78 

41.66 

20.42 

51.51 

43.32 

4.18 

33.44 

8.21 

24.73 

CLE 

OAB5 

24.15 

67.16 

66.31 

36.74 

97.16 

37.00 

64.97 

126.36 

18.56 

14.87 

37.06 

11.71 

67.13 

47.71 

11.30 

42.18 

5.85 

9.85 


IPCA 

41.64 

127.88 

80.60 

37.33 

133.67 

41.60 

56.60 

46.13 

6.70 

31.99 

33.61 

24.16 

20.90 

41.69 

28.02 

5.26 

21.22 

2.42 


LIT 

22.60 

139.89 

108.64 

124.09 

103.08 

119.69 

64.70 

27.64 

5.28 

64.07 

24.26 

12.76 

44.12 

46.98 

32.27 

160.79 

37.40 

25.46 


Struck 

28.62 

139.90 

24.15 

10.83 

3.24 

3.76 

5.50 

4.82 

4.54 

4.32 

22.61 

3.32 

7.38 

42.06 

5.77 

7.71 

6.54 

2.09 


CT 

37.22 

23.04 

41.59 

35.94 

106.72 

29.97 

15.83 

50.97 

8.98 

13.50 

45.78 

22.06 

34.10 

9.33 

6.61 

76.20 

11.34 

29.43 


ALSA 

15.73 

30.50 

23.11 

28.14 

3.99 

3.17 

12.98 

4.99 

11.20 

31.23 

25.34 

20.19 

36.87 

6.44 

3.19 

35.41 

10.38 

2.05 


SCM 

16.98 

22.94 

20.35 

31.81 

26.60 

3.56 

94.74 

12.30 

8.68 

36.85 

23.04 

11.66 

7.47 

8.25 

5.18 

33.12 

7.17 

1.92 


TLD 







9.73 

13.11 



19.58 


19.15 




9.10 



Ours+S 

0.78 

0.67 

0.74 

0.73 

0.69 

0.71 

0.65 

0.83 

0.67 

0.72 

0.78 

0.76 

0.78 

0.86 

0.80 

0.82 

0.83 

0.77 


Ours 

0.72 

0.66 

0.70 

0.72 

0.67 

0.69 

0.65 

0.78 

0.66 

0.72 

0.74 

0.72 

0.78 

0.85 

0.79 

0.80 

0.82 

0.76 


DML 

0.43 

0.18 

0.20 

0.54 

0.21 

0.47 

0.36 

0.65 

0.54 

0.36 

0.68 

0.68 

0.60 

0.63 

0.68 

0.78 

0.79 

0.69 


FragT 

0.24 

0.13 

0.51 

0.09 

0.36 

0.44 

0.05 

0.34 

0.54 

0.30 

0.35 

0.67 

0.64 

0.57 

0.64 

0.23 

0.65 

0.08 


VTD 

0.49 

0.06 

0.12 

0.09 

0.50 

0.52 

0.11 

0.33 

0.38 

0.32 

0.20 

0.70 

0.73 

0.67 

0.66 

0.16 

0.74 

0.37 


MILT 

0.41 

0.03 

0.34 

0.50 

0.06 

0.13 

0.35 

0.29 

0.59 

0.55 

0.18 

0.53 

0.39 

0.50 

0.71 

0.34 

0.63 

0.15 


OAB1 

0.34 

0.04 

0.35 

0.46 

0.05 

0.26 

0.04 

0.16 

0.57 

0.63 

0.21 

0.47 

0.34 

0.47 

0.73 

0.28 

0.64 

0.37 

VOR 

OAB5 

0.40 

0.13 

0.36 

0.45 

0.05 

0.21 

0.04 

0.06 

0.35 

0.40 

0.24 

0.46 

0.24 

0.39 

0.58 

0.36 

0.74 

0.43 


IPCA 

0.30 

0.05 

0.26 

0.46 

0.02 

0.19 

0.03 

0.36 

0.65 

0.23 

0.30 

0.45 

0.62 

0.50 

0.51 

0.79 

0.49 

0.78 


LIT 

0.41 

0.07 

0.09 

0.09 

0.06 

0.07 

0.03 

0.28 

0.67 

0.16 

0.41 

0.52 

0.46 

0.50 

0.33 

0.08 

0.27 

0.43 


Struck 

0.35 

0.11 

0.43 

0.65 

0.60 

0.68 

0.66 

0.80 

0.64 

0.72 

0.41 

0.76 

0.78 

0.52 

0.73 

0.66 

0.68 

0.79 


CT 

0.25 

0.30 

0.33 

0.45 

0.03 

0.47 

0.41 

0.22 

0.57 

0.43 

0.17 

0.45 

0.45 

0.75 

0.69 

0.26 

0.63 

0.29 


ALSA 

0.47 

0.39 

0.40 

0.51 

0.57 

0.60 

0.42 

0.78 

0.55 

0.30 

0.24 

0.46 

0.39 

0.76 

0.60 

0.18 

0.66 

0.79 


SCM 

0.36 

0.42 

0.43 

0.50 

0.27 

0.61 

0.11 

0.70 

0.57 

0.29 

0.45 

0.48 

0.76 

0.71 

0.65 

0.23 

0.71 

0.77 


TLD 







0.56 

0.69 



0.48 


0.66 




0.70 



Ours+S 

0.95 

0.81 

0.98 

0.99 

0.90 

0.94 

0.88 

0.99 

0.98 

0.97 

0.98 

0.96 

0.99 

0.98 

0.99 

0.99 

0.99 

1.00 


Ours 

0.94 

0.80 

0.89 

0.98 

0.88 

0.93 

0.87 

0.98 

0.97 

0.97 

0.98 

0.95 

0.99 

0.97 

0.98 

0.99 

0.99 

1.00 


DML 

0.43 

0.18 

0.12 

0.67 

0.20 

0.59 

0.37 

0.90 

0.60 

0.46 

0.82 

0.85 

0.79 

0.74 

0.86 

0.99 

0.99 

0.92 


FragT 

0.25 

0.11 

0.63 

0.09 

0.47 

0.52 

0.05 

0.40 

0.49 

0.32 

0.37 

0.88 

0.79 

0.51 

0.71 

0.22 

0.90 

0.08 


VTD 

0.57 

0.07 

0.11 

0.11 

0.62 

0.70 

0.10 

0.37 

0.47 

0.31 

0.20 

0.79 

0.92 

0.98 

0.78 

0.20 

0.99 

0.43 


MILT 

0.37 

0.03 

0.42 

0.64 

0.07 

0.16 

0.28 

0.34 

0.67 

0.61 

0.22 

0.61 

0.19 

0.58 

0.89 

0.18 

0.87 

0.08 

Success 

Rate 

OAB1 

0.37 

0.01 

0.40 

0.62 

0.05 

0.14 

0.04 

0.13 

0.67 

0.87 

0.22 

0.58 

0.18 

0.58 

0.88 

0.19 

0.80 

0.39 

OAB5 

0.43 

0.08 

0.42 

0.67 

0.05 

0.12 

0.04 

0.03 

0.23 

0.24 

0.22 

0.38 

0.14 

0.53 

0.69 

0.25 

0.91 

0.33 

IPCA 

0.34 

0.06 

0.31 

0.62 

0.01 

0.09 

0.03 

0.38 

0.87 

0.22 

0.25 

0.45 

0.80 

0.58 

0.74 

0.99 

0.60 

1.00 


LIT 

0.43 

0.07 

0.06 

0.11 

0.07 

0.08 

0.04 

0.34 

0.87 

0.16 

0.49 

0.61 

0.57 

0.58 

0.45 

0.10 

0.28 

0.59 


Struck 

0.40 

0.13 

0.37 

0.83 

0.74 

0.95 

0.93 

0.96 

0.83 

0.96 

0.47 

0.97 

0.99 

0.58 

0.89 

0.88 

0.93 

1.00 


CT 

0.27 

0.15 

0.29 

0.64 

0.01 

0.67 

0.35 

0.12 

0.60 

0.27 

0.22 

0.47 

0.37 

0.88 

0.83 

0.20 

0.89 

0.17 


ALSA 

0.46 

0.31 

0.41 

0.68 

0.65 

0.73 

0.40 

0.95 

0.58 

0.31 

0.20 

0.48 

0.37 

0.93 

0.70 

0.20 

0.96 

1.00 


SCM 

0.30 

0.53 

0.42 

0.66 

0.32 

0.78 

0.12 

0.84 

0.61 

0.30 

0.63 

0.49 

0.99 

0.91 

0.83 

0.23 

0.94 

1.00 


TLD 







0.70 

0.81 



0.63 


0.90 




0.96 



TABLE IV: Quantitative comparison results of the fifteen trackers over all the video sequences. The table reports their average CLEs, VORs, and 
success rates over each video sequence. Clearly, our tracker achieves the best tracking performance in most cases. In the experiments, the TLD 
tracker produces the incomplete tracking results over some video sequences because of its particular tracking-learning-detection properties (i.e., 
tracking reliability analysis by simultaneously performing object detection and optical flow-based verification). Therefore, we only show the video 
sequences in which the TLD tracker can always achieve stable tracking performances for all the frames. 



(a) (b) 


Fig. 14: Comparison between Struck and our tracker (highlighted in different colors). Specifically, (a) shows the case that our tracker almost fails 
while Struck succeeds in localizing the object, (b) displays the case that both of the trackers lose the object. 



CLE 

VOR 

Success Rate 


trellis70 

race 

cubicle 

football3 

trace 

seq-jd 

trellis70 

race 

cubicle 

football3 

trace 

seq-jd 

trellis70 

race 

cubicle 

football3 

trace 

seq-jd 

Struck 

4.82 

7.71 

22.61 

4.32 

24.15 

3.32 

0.80 

0.66 

0.41 

0.72 

0.43 

0.76 

0.96 

0.88 

0.47 

0.96 

0.37 

0.97 

Non-structured metric learning 

5.62 

8.52 

4.31 

3.92 

8.65 

4.30 

0.78 

0.80 

0.74 

0.72 

0.70 

0.72 

0.98 

0.99 

0.98 

0.97 

0.89 

0.95 

Structured metric learning 

4.22 

6.96 

3.10 

5.63 

6.23 

3.02 

0.83 

0.82 

0.78 

0.72 

0.74 

0.76 

0.99 

0.99 

0.98 

0.97 

0.98 

0.96 

No metric learning 

7.80 

15.86 

4.43 

4.01 

15.75 

5.16 

0.73 

0.69 

0.70 

0.68 

0.56 

0.68 

0.93 

0.99 

0.96 

0.95 

0.62 

0.94 


TABLE V: Quantitative evaluation of the proposed tracker using different learning strategies on eight video sequences. The table reports their 
average CLEs, VORs, and success rates across frames. 
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■ Non-structured metric learning ■ Structured metric learning 



Fig. 15: Runtime performance of the proposed tracker using different metric learning strategies on eight video sequences. The table reports their 
average running time of performing metric learning across frames. 
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V. Pedestrian tracking and identification 

Recent studies have demonstrated the effectiveness of combining object identification and tracking together. To achieve 
this goal, we need to first localize the object of interest and then assign it to one of the predefined object classes using the 
object tracking information. Without loss of generality, we suppose there are totally K object classes that correspond to 
K static template sample sets { Pq-}£ =1 (collected before object tracking). After performing object tracking on a video 
sequence ranging from frame 1 to frame £, we obtain the consecutive object observations y\ :t whose object classification 
scores are denoted as (<S(yi),..., <S(y*)) (as defined in Equ. (3)). In essence, these classification scores reflect the 
likelihood of the observations to be generated from the object of interest. With respect to P^-, we compute a set of 
reconstruction errors (g(x^; M, P^-, yi),..., ^(x^; M, P^-, y t )) such that xj^ = arg min x g(x; M, P^-, y t ). Based on 
these reconstruction errors, the cumulative distance of y t with respect to the k- th object class is calculated as: 

t 

(yt) = E]w(y i )5t( x ^; M , p ^,y i ), (19) 

i= 1 

where cc(y^) is a weighting factor that measures the prior weight of y i generated from the object of interest. Here, we just 
use the object classification score <S(y^) to approximate the prior weight cc(y i) in the process of object identification (such 
that cc(yi) oc S( y^)). As a result, the object class membership for y t is determined by: k* = arg min 1~Lk(yt)- In addition, 

\<k<K 

the above-mentioned object identification module has the capability of automatically detecting the abnormal events (e.g., 
occlusion). When g(xj,; M, P^-, y^) is very large, the tracked objects often have drastic appearance changes (caused by 
occlusion, noisy corruption, shape deformation, and so on). Therefore, the abnormal changes in object appearance can 
be automatically detected by checking the value of M, P^-, y^). 

Based on Equ. (19), we carry out the pedestrian identification task on the video sequences 1 with two viewpoints. Prior 
to pedestrian tracking and identification, we collect a set of static templates from some training video sequences. In 
total, there are six individual pedestrians corresponding to six object classes. Fig. 16 shows the pedestrian tracking and 
identification results as well as the static templates used in tracking. For a clear illustration, we give an intuitive example 
of showing the whole pedestrian identification process, as shown in Fig. 17. From Fig. 17, we observe that our method 
is able to accurately recognize the tracked pedestrian’s identity throughout the entire video sequence. More pedestrian 
identification results can be found in the supplementary. 

Moreover, we apply our method to detect whether the abnormal events (e.g., occlusion) take place in the tracking 
process. Namely, if the target appearance is weakly correlated with the static templates (i.e., high reconstruction errors), 
there is likely to be some abnormal events occurring during tracking. Fig. 18 shows an example of detecting the frame- 
by-frame occlusion events on the “girl” video sequence. It is seen from Fig. 18 that our method succeeds in detecting 
the occlusion events in most cases. 


1 http ://homepages .inf. ed. ac .uk/rbf/C AVI ARDATA1 / 
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Fig. 16: Two-view pedestrian identification examples. It is clear that the pedestrians can be accurately identified. 



Fig. 17: Illustration of our pedestrian identification method. The first column shows the tracking results on the video frames at two different 
viewpoints; the second column displays the frame-by-frame reconstruction errors based on frame-independent metric-weighted linear regression; 
the third column exhibits the frame-by-frame identification results associated with the second column; the fourth column plots the frame-by-frame 
cumulative reconstruction errors based on frame-dependent metric-weighted linear regression; and the last column corresponds to the frame-by-frame 
identification results associated with the fourth column. Clearly, our cumulative classification method is able to correctly identify the same pedestrian 
from two different viewpoints. 


Templates 



Sample Frame 
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ftj 

Occlusio 
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Fig. 18: Example of detecting the occlusion events. The left part shows a sample frame and the static templates, and the right part plots the 
frame-by-frame occlusion detection results during tracking. 
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VI. Conclusion 

In this work, we have proposed an online metric-weighted linear representation for robust visual tracking. With a 
closed-form analytical solution, the proposed linear representation is capable of effectively encoding the discriminative 
information on object/non-object classification. We designed an online Mahalanobis distance metric learning scheme, 
including online non-structured and structured metric learning. The metric learning scheme aims to distinguish the relative 
importance of individual feature dimensions and capture the correlation between feature dimensions in a feasible metric 
space. We empirically show that adding a metric to the linear representation considerably improves the robustness of the 
tracker. To make the online metric learning even more efficient, for the first time, we design a learning mechanism based 
on time-weighted reservoir sampling. With this mechanism, recently streamed samples in the video are assigned higher 
weights. We have also theoretically proved that metric learning based on the proposed reservoir sampling with limited¬ 
sized sampling buffers can effectively approximate metric learning using all the received training samples. Compared 
with state-of-the-art trackers on eighteen challenging sequences, we empirically show that our method is more robust to 
complicated appearance changes, pose variations, and occlusions. Furthermore, we also extend our work to perform 
pedestrian identification and occlusion event detection during object tracking. Experimental results demonstrate the 
effectiveness of our work. 

To balance efficiency and effectiveness, a mixture of non-structured and structured metric learning methods can be 
alternatively applied during tracking. For example, the non-structured method can produce metric learning results with 
a high update frequency, and then the structured method further generates the refined metric learning results with a low 
update frequency. We plan to investigate this in future. 
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Supplementary 

In this supplementary material, we provide more technical details of online updating the metric-weighted linear 
representation, online discriminative distance metric learning, and a theoretical analysis of time-weighted reservoir 
sampling. Furthermore, we show more experimental results (both qualitatively and quantitatively), including experimental 
demonstration videos, more CLE (center location error) and VOR (VOC overlap ratio) curves for different evaluation 
tasks, intuitive frame tracking images, and more frame-by-frame pedestrian identification results. 


Video sequences 

Corresponding video files 

B-Beam 

Video 01 BalanceBeam.mp4 

Lola 

Video 02 Lola.mp4 

trace 

Video 03 trace.mp4 

Walk 

Video 04 Walk.mp4 

football 

Video 05 football.mp4 

iceball 

Video 06 iceball.mp4 

coke11 

Video 07 cokel 1 .mp4 

trellis70 

Video 08 trellis70. mp4 

dograce 

Video 09 dograce.mp4 

football3 

Video 10 football3. mp4 

cubicle 

Video l l cubicle.mp4 

seq-jd 

Video 12 seq-jd.mp4 

girl 

Video 13 girl. mp4 

BMX-Street 

Video 14 BMX-Street.mp4 

plane show 

Video 15 plane show, mp4 

race 

Video 16 race. mp4 

CamSeqOl 

Video 17 CamSeq01 .mp4 

carll 

Video 18 carl 1 .mp4 


TABLE VI: The configurations of the eighteen experimental demonstration videos. These demonstration videos can be downloaded at the following 
link: http://cs.adelaide.edu.au/users/xi/pamimetricdemo.zip 
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VII. Online proximity based metric learning 

Having introduced the metric-weighted linear representation in Sec. III-B, we now address the key issue of calculating 
the metric matrix M. M should ideally be learned from the visual data, and should be dynamically updated as conditions 
change throughout a video sequence. 

1) Triplet-based ranking losses: Suppose that we have a set of sample triplets {(p, p + , p - )} with p, p + , p - e 7 Z d . 
These triplets encode the proximity comparison information. In each triplet, the distance between p and p + should be 
smaller than the distance between p and p - . 

The Mahalanobis distance under metric M is defined as: 

77m (p, q) = (p - q) T M(p - q). ( 20 ) 

Clearly, M must be a symmetric and positive semidefinite matrix. It is equivalent to learn a projection matrix L such 
that M = LL t . In practice, we generate the triplets set as: p and p + belong to the same class and p and p - belong to 
different classes. So we want the constraints Z7 M (p,p + ) < J D M (p,p _ ) to be satisfied as well as possible. By putting 
it into a large-margin learning framework, and using the soft-margin hinge loss, the loss function for each triplet is: 

Im(p, P + , p") = max{0,1 + £>m(p, P + ) - £>m(p, P - )}- (21) 

2) Large-margin metric learning: To obtain the optimal distance metric matrix M, we need to minimize the global 
loss Lm that takes the sum of hinge losses ( 21 ) over all possible triplets from the training set: 

L m = ^2 Mp,P + ,P - ), (22) 

(p,p+,p~)eQ 

where Q is the triplet set. To sequentially optimize the above objective function Lm in an online fashion, we design an 
iterative algorithm to solve the following convex problem: 

M /c+1 = argmin^IlM - 

m (23) 

s.t. L>m(p, p _ ) - £>m(p, P + ) > 1 - £, £ > o, 

where || • \\p denotes the Frobenius norm, £ is a slack variable, and C is a positive factor controlling the trade-off between 
the smoothness term ^||M — M. k \\^ and the loss term £. Following the passive-aggressive mechanism used in [22], [43], 
we only update the metric matrix M when /m(p, p + , p - ) >0. 

3) Optimization of M: We optimize the function in Equation 23 with Lagrangian regularization: 

£(M, 77 , £,£) = § ||M - M k \\ 2 F + C£ - /?£ + ^(1 - C + L> M (p, P + ) - D M (p, p")), (24) 

where r] > 0 and (3 > 0 are Lagrange multipliers. The optimization procedure is carried out in the following two 
alternating steps. 

• Update M. By setting = 0, we arrive at the update rule 

M /e+1 = M k + t?U (25) 


where U = a_a^ — a + a^ and a + = p — p + , a = p p . 

By taking the derivative of £(M, 77 ,^,/?) w.r.t. M, we have the following: 

a£(M,77,c,^) _ iy/r iy/T fe 1 ^[ 1} m(p,p + )-^m(p,p1] 

9M — iVA iVA ^ 'I dm 

Mathematically, ^• Pm ^ p,p ^ can be formulated as: 

<9[L> M (P,P + ) -D m (p,P - )] _ T t 
— - a +a + - a_a_, 


(26) 


(27) 


where a + = p — p + and a_ 
the following relation holds: 


The optimal M fc+1 is obtained by setting t0 zero - a resu lfi 


M /c+1 =M /c +77(a_a^ 


Update r\. Subsequently, we take the derivative of the Lagrangian (24) w.r.t. £ and set it to zero, leading to the 
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update rule: 


dC(M, v ,Z,l3) 


= C-(3-r] = 0 . 


Clearly, /3 > 0 leads to the fact that r? < C. For notational simplicity, a_a^ — a + a^ is abbreviated as U hereinafter. 
By substituting Equs. (28) and (29) into Equ. (24) with M = M fe+1 , we have: 


where -D M k+i(p, p -1 
reformulated as: 


£(v) = ^W^Wf+ v(l +D M k+i(p,p + ) - D M k+i(p,p )), (30) 

a^ (M k + r/U)a + and D M k+ i(p, p _ ) = a^(M fc + r^U)a_. As a result, C(rj) can be 

£(v) = A 2 f? 2 + Air? + A 0 , (31) 


where A 2 = |||U|||. + a+Ua + — a^Ua_, Ai = 1 + a+M fc a + — a 7 )M A a_, and Ao = 0. To obtain the optimal 77 , 
we need to differentiate C(rj) w.r.t. rj and set it to zero: 


As a result, the following relation holds: 


7 ?(||U|||p + 2 a^Ua+ - 2 a^Ua_) 
+(1 + a^M fe a + — a^M fc a_ 

1 + a^M fc a+ - a^M fe a_ 

" “ llUlli + 2aIUa + - 2a^Ua_ ' 


Due to the constraint of 0 < 77 < C, 77 should take the following value: 

f f 1 + aTM fc a + - a^M fe a_ 'I) 

,,=min |C, max |0, 2aZu ^_ 2a3:Ua+ _ ||U || j . )) (34) 

The full derivation of each step can be found in the supplementary file. The complete procedure of online distance metric 
learning is summarized in Algorithm 2. 

4) Online update: When updated according to Algorithm 2, M is modified by rank-one additions such that M <— 
M + r^(a_a^ — a + a+) where a + = p — p + and a_ = p — p - are two vectors (defined in Equ. (28)) for triplet 
construction, and rj is a step-size factor (defined in Equ. (34)). As a result, the original P T MP becomes P T MP + 
(^P T a_)(P T a_) T + (—r^P T a + )(P T a + ) T . When M is modified by a rank-one addition, the inverse of P T MP can 
be updated according to the theory of [44], [45]: 


(J + uv 7 


= J 1 - 


l + v^-iu* 


Here, J = P T MP, u = r?P T a_ (or u = — 7 ?P T a + ), and v = P T a_ (or v = P T a + ). 
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VIII. Online structured metric learning 

Metric learning based on sample proximity comparisons leads to an efficient online learning algorithm, but requires 
pre-defined sets of positive and negative samples. In tracking, these usually correspond to target/non-target image patches. 
The boundary between these classes typically occurs where sample overlap with the target drops below a threshold, but 
this can be difficult to evaluate exactly and thus introduces some noise into the algorithm. 

In this Section, we replace the proximity based metric learning module with an online structured metric learning 
method for learning M. The main advantage of this method is that it directly learns the metric from measured sample 
overlap, and therefore does not require the separation of samples into positive and negative classes. 

Structured ranking Let p t and p] denote two feature vectors extracted from two image patches, which are respectively 
associated with two bounding boxes R t and RJ from frame t. Without loss of generality, let us assume that H t corresponds 
to the bounding box obtained by the current tracker while RJ is associated with a bounding box from the area surrounding 
H t . As in [13], the structural affinity relationship between and p\ is captured by the following overlap function: 
St(Pt°Rt,p|oRj) = . As a result, we define the following optimization problem for structured metric learning: 

M k+i _ ar g m i n ±||M — M fe ||p + C£, 

m (36) 

s.t. £> M (pt,Pt) --E>M(Pt,Pt) > 

where £ > 0 and = s° (p, oR t , p\ o R|) — s° (p ( o R,, pj o R ,') . Clearly, the number of constraints in the optimization 
problem (36) is exponentially large or even infinite, making it difficult to optimize. Our approach to this optimization 
problem differs from [13] in four main aspects: i) our approach aims to learn a distance metric while [13] seeks a SVM 
classifier; ii) we optimize an online max-margin objective function while [13] solves a batch-mode optimization problem; 
iii) our optimization problem involves nonlinear constraints on triplet-based Mahalanobis distance differences, while the 
optimization problem in [13] comprises linear constraints on doublet-based SVM classification score differences; and iv) 
our approach directly solves the primal optimization problem while [13] optimizes the dual problem. 

Structured optimization Inspired by the cutting-plane method, we iteratively construct a constraint set (denoted as 
V) containing the most violated constraints for the optimization problem (36). In our case, the most violated constraint 
is selected according to the following criterion: 

(/L v) = argmax + L>m(p*, Pt) - £>m(Pu P*)> (37) 

(fii) 

For notational simplicity, let Zm(p* °Rt,pj o R^, p\ o RJ) denote the loss term A^ + D^iPt, P \) ~ ^m(Pu Pt )• Note 
that the violated constraints generated from (37) are used if and only if lm(Pt o R t ,p^ °Rj,Pt °R't) is greater than 
zero. Subsequently, we add the most violated constraint to the optimization problem (36) in an iterative manner, that is, 
V V U{(Pt i ° Pt ° Rt )}• The corresponding Lagrangian is formulated as: 

£ = !l|M - M fc |||, + (c- P)Z + El=‘i - Z + d m (pu pH - ^m(p*,pH], ( 38 > 


where /3 > 0 and > 0 are Lagrange multipliers. The optimization procedure is once again carried out in two alternating 
steps: 

• Update M. By setting to zero, we obtain an updated M defined as: 

\v\ 

M fe+1 =M k +J2 (39) 

1=1 

where = a^ £ (a^) T — a^ € (a^) T , and aj 7, denotes p t — p™. 

The first-order derivative of C w.r.t. M is expressed as: 


dC 

<9M 


M 


1 u t Sp' d[D M (p t ,p" e ) - D M (p t ,p? e ) 
M - aM - 


(40) 


Clearly, - is equal to (p* — p”)(pt — p") T - Letting a” denote p/ — p", we rewrite (40) as: = 


dM ^H ual 


a t‘ (af' 


]. By setting to zero, we obtain the optimal M defined as: M 


M fc +Ea%[aHar 


af(aH 
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• Update rj£. To obtain the optimal solution for all Lagrange multipliers rp, we take the first-order derivative of C 
w.r.t. rj£ and set it to zero: 


|§ = mi 1 T (U, o U,)l + 2(af') T U <a f' - 
+ E Vm[l T (Ue o U m )l + (af m ) T U m af m 

m^t 


2«‘) T U / a?] + [A wv/ + - 

- (a^) T U m a^ + (a^EWar -«’ 


■ K<) T M k <'] 
) T U,a^] = 0, 


(41) 


where 1 is the all-one column vector and o is the elementwise product operator. Hence, we have a linear equa- 
tion Br] = f, where rj = (??i, m, ■ ■ ■ , t7|P|) T > f = (/i, h, ■ ■ •, /|P|) with /* being -[A w „ ( + (af £ ) T M fe af c - 
(a^) T M fc a^], and B = (b im )\v\x\v\ with b lm being 1 T (W ° U m )l + (af m ) T U m af m - (a^ m ) T U m a^ m + 
(a^) T U,af™ - (a^fU^. 

Differentiating C w.r.t. £ and setting it to zero, we have C — /? — Er=i % = 0. Since /? > 0, the relation 
0 < Ei=i % < C holds. Therefore, the optimal rj* is efficiently obtained by solving the following optimization 
problem: 

77 * = argmin ||B ?7 — f ||i, s.t. r\ >z 0; 1 77 < C. ( 42 ) 


As before, the optimal M is updated as a sequence of rank-one additions: M <— M+ 7y^[a^(a^) T — a^(a^) T ]. As 
a result, the original P T MP becomes P MP + (? 7 ^P T a^)(P T a^) T + (—? 7 ^P T a^ £ )(P T a^) T . When M is modified 
by a rank-one addition, the inverse of P MP can be easily updated according to the theory of [44], [45]. Namely, 
(J + uv T ) -1 = J -1 — J 1 + v^ V j J| u • Here, J = P MP, u = rj^P 1 ^ (or u = —^P T a^ £ ), and v = P T a^ £ (or 
v = P T af). 
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IX. Theoretical analysis of time-weighted reservoir sampling 


Theorem IX.l. Given a new training sample p, we have the following relation: 


I Bc |W | E ffc + E ffc_ E E ^m(p,P + ,P“ 

1+111 \p+eB c +p-ei3 c _ 


u °+ c + 

V Wi 

h c . 


E^ L ^m(p,P- + ,P- 


<_1 E «>m + \ i_1 E Wn~ \ 

where E(-) w f/ie expectation operator, c + G {/, 6} is a class indicator variable whose class membership is the same 
as p (i.e., if p G f, c + = f; otherwise, c + = b), c_ G {/, 6} w <2 c/oss indicator variable whose class membership 
is different from p (i.e., if p G f, c_ = b; otherwise, c_ — f), {pfLIi {Pjljli denote all the received training 
sample sets before p, w{ and w h - are the corresponding weights of p{ and p^. In our case, any sample weight (or 
w h -) is defined as: = q 1 * (or = q f o) where l{ (or 1^) is the corresponding frame index number of p{ (or p b j) and 

q is a constant such that q > 1. 

Proof: In total, there are two cases for p: i) p is a foreground sample (i.e., p G /) with c+ = / and c_ = 6; and ii) p is a background 
sample (i.e., p G b) with c+ = b and c_ = /. Therefore, when p is a foreground sample, we need to prove the following relation: 


| Bf || B ,| E B/ E B6 E E Zm(p,p + ,p ) 

1 \p+eB f p-eB b 


h f w f h b w b 

e -tj*— ev-»m(p,p{,p5: 

=1 E V =1 E < 


Conversely, we need to prove the following relation: 


m\B7\ E Bb E B f E E Mp,p + ,p ) 

\p+eB bP -eB f 


wb i w i i I b f\ 

h b E XT - ^(P’Pi’Pj) 

E win \ J=1 E 

m= 1 \ n = 1 


First of all, we cope with the foreground case defined in Equ. (43). The expectation in Equ. (43) can be computed as: 


■|B f UB b \ E Bf E Bb E E «M(P,P + ,P ) 

\p+eB f p £t3 b 


wpfiBf \ e vk\ EB * [ E /M ( p ’ p+ ’ p ) 


According to the property of weighted reservoir sampling with replacement (as shown in Refs [19, 20]), we have: 


f^B b ^ E *M(P,P+P = v _E w Ji M (p,P + ,v_)] 


Here, is the probability distribution associated with {p^},^, and its corresponding probability mass function is defined as: 

, w h - 

Pr(v_ =pj) = 

Wk 3 h h 


As a result, Equ. (45) can be rewritten as: 


| Bf MBJ E Bf E B b E E MP,P + ,P ) 

1 \p+eB f p~eB b 


\B f \ 


E Bf { E E [/ M (p,p + , v_)] 


( 48 ) 
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Similar to Equ. (46), we obtain the following relation: 


T E B J E [ E [i M (p,P + ,v_)]l i 

^ P +eB/ L v -~ w i» J J 


= E W E W [ / m(p,v+,v_)] 


Here, Wy is the probability distributions corresponding to {p^ } i l 1 with the following probability mass function: 

f W I 

Pr ( v + = p {) = ~r~ • 


Therefore, 


|BfiWl Es / Eg t ^ E MP>P%> ) 

\p+eB f p-eB b 

= E w { E w [ ; m(p,v + ,v_)]|. 


Based on Equ. (47) and Equ. (50), we reformulate E < E [Zm(p> v+, v -)] f as: 


V±~Wf I V_~W; 


E < E [Z M (p,v+,v_)] 


V I ~Wf v_~W, 


= E \ E i^ 2 — /m(p,v+,pP 

V +~ W / j = 1 ^ 


h f w f I h b w b 

E j - E , 'm(p-p/.p)> 

i=1 e V =1 E < 


As a result, we have the following relation: 


|BflWl EB / Eg i» ^ £ Mp,P + ,P ) 

\p+GB f p GS b 


h f w f I hb w b - f u 

E -st*— E ,'Mip-pf-P.^ 

i=1 E V j=1 E < 


Finally, we complete the proof of Equ. (43). Furthermore, we need to prove the background case defined in Equ. (44). After a similar process (i.e., 
from Equ. (45) to Equ. (52)), we can obtain: 


wmr\ EB » EB f £ E *m(p,p + ,p" 

1 \p+eB bP -eB f 


h f f 
J w J . 


E v 2 -/m( P’P i’Pp 


E l - 7 ’- 1 E 


As a result, we complete the proof of Equ. (44). Based on the conclusions of Equ. (53) and Equ. (54), we have: 


|B c ,||B e | E Bc + E Bc_ E E *m(P,P + >P ) 
+ \p+ee c+ p-eBc_ 


^C_|_ Cl _ c — 

= E E^Im(p,p: + , P j c -) 

11 Z w^+ \ 3 1 E ™ C n~ 


Consequently, we complete the proof of Theorem IX. 1. 



X. Performance with and without metric learning 


To justify the effect of different metric learning mechanisms, we design several experiments on five video sequences. 
Fig. 19 and Tab. VII show the corresponding experimental results of different metric learning mechanisms in CLE, 
VOR, and success rate. From Fig. 19 and Tab. VII, we can see that the performance of metric learning is better 
than that of no metric learning. In addition, the performance of metric learning with no eigendecomposition is close 
to that of metric learning with step-by-step eigendecomposition, and better than that of metric learning with final 
eigendecomposition. Therefore, the obtained results are consistent with those in [22]. Besides, metric learning with 
step-by-step eigendecomposition is much slower than that with no eigendecomposition which is adopted by the proposed 
tracking algorithm. 
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CLE 

VOR 

Success Rate 


cubicle 

football 

iceball 

trellis70 

seq-jd 

cubicle 

football 

iceball 

trellis70 

seq-jd 

cubicle 

football 

iceball 

trellis70 

seq-jd 

ML w/o eigen 

4.31 

3.14 

3.03 

5.61 

4.30 

0.74 

0.67 

0.68 

0.78 

0.72 

0.98 

0.88 

0.93 

0.98 

0.94 

ML with final eigen 

5.95 

5.69 

5.10 

8.49 

7.07 

0.67 

0.59 

0.63 

0.70 

0.63 

0.94 

0.74 

0.90 

0.94 

0.82 

ML with step-by-step eigen 

2.16 

1.89 

1.30 

4.54 

3.31 

0.79 

0.71 

0.69 

0.82 

0.75 

0.98 

0.90 

0.95 

0.99 

0.95 

No metric learning 

5.45 

51.73 

4.31 

8.85 

6.29 

0.66 

0.27 

0.64 

0.68 

0.63 

0.86 

0.36 

0.88 

0.91 

0.82 


TABLE VII: Quantitative evaluation of the proposed tracker with different metric learning configurations on five video sequences. The table reports 
their average tracking results in CLE, VOR, and success rate. 


cubicle football 



iceball 


trellis70 


— ML W/0 Eigen 

'■■■ ML With Final Eigen 

■ ■ ■ ML With Step-by-Step Eigen 
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Fig. 19: Quantitative evaluation of the proposed tracker with/without metric learning on five video sequences. The top two rows are associated with 
the tracking performance in CLE, while the bottom two rows correspond to the tracking performance in VOR. 


XI. Comparison of different linear representations 

We evaluate the performance of four types of linear representations including our linear representation with metric learn¬ 
ing, our linear representation without metric learning, compressive sensing linear representation [7], and i \-regularized 
linear representation [6]. For a fair comparison, we utilize the raw pixel features which are the same as [7], [6]. Fig. 20 
shows the performance of these four linear representation methods in CLE and VPR on four video sequences. Clearly, 
our linear representation with metric learning consistently achieves lower CLE (higher VOR) performance in most frames 
than the three other linear representations. 
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Fig. 20: Quantitative comparison of different linear representation methods in CLE and VOR on four video sequences. 


XII. Evaluation of different sampling methods 

We aim to examine the performance of the two sampling methods. Fig. 21 shows the experimental results of the two 
sampling methods in CLE and VOR on four video sequences. From Fig. 21, we can see that weighted reservoir sampling 
performs better than ordinary reservoir sampling. 
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Fig. 21: Quantitative comparison of different sampling methods in CLE on four video sequences The top two rows show the tracking performance 
in CLE; the bottom two rows display the tracking performance in VOR. Before exceeding the buffer size limit (approximately occurring between 
frame 40 and frame 50), the performances of different sampling methods are identical. 


XIII. Pedestrian identification 

Based on Equ. (19), we perform the pedestrian identification task with two viewpoints. Fig. 22 shows the quantitative 
frame-by-frame identification results for five pedestrians with different viewpoints. Moreover, Figs. 23-31 display the 
tracking and identification results for six pedestrians on several representative frames. Clearly, our method is able to 
achieve a robust pedestrian identification performance in different cases. 
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Fig. 22: Intuitive illustration of two-view pedestrian identification, (a)-(e) show the quantitative frame-by-frame identification results of different 
pedestrians from two viewpoints. It is clear that our method is able to assign the pedestrians to correct classes. 
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Fig. 23: Identification and tracking results for the first pedestrian on several representative frames (from View 1). 



Fig. 24: Identification and tracking results for the first pedestrian on several representative frames (from View 2). 



Fig. 25: Identification and tracking results for the second pedestrian on several representative frames (from View 1). 
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Fig. 26: Identification and tracking results for the second pedestrian on several representative frames (from View 2). 



Fig. 27: Identification and tracking results for the third pedestrian on several representative frames (from View 1). 



Fig. 28: Identification and tracking results for the third pedestrian on several representative frames (from View 2). 
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Fig. 29: Identification and tracking results for the fourth pedestrian on several representative frames (from View 1). 



Fig. 30: Identification and tracking results for the fifth pedestrian on several representative frames (from View 1). 



Fig. 31: Identification and tracking results for the sixth pedestrian on several representative frames (from View 1). 
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XIV. Comparison with the state-of-the-art trackers 

We report the quantitative tracking results of the eleven trackers in CLE and VOR over the eighteen video sequences. 
Figs. 32 and 33 plot the frame-by-frame CLEs and VORs (marked with the curves in different colors) obtained by 
the eleven trackers. From Figs. 32 and 33, we observe that the proposed tracking algorithm achieves the best tracking 
performance on most video sequences. 

Moreover, we show the corresponding qualitative tracking results of the eleven trackers (highlighted by the bounding 
boxes in different colors) over the representative frames of the eighteen video sequences in Figs. 34-51. Clearly, it is 
seen from Figs. 34-51 that the proposed tracking algorithm obtains the most accurate tracking results in most cases. 
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Fig. 32: Quantitative comparison of different trackers in CLE on the eighteen video sequences. 
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Fig. 33: Quantitative comparison of different trackers in VOR on all the eighteen video sequences. 
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Fig. 34: Tracking results of different trackers over some representative frames from the “BalanceBeam” video sequence in the scenarios with drastic 
body pose variations and background clutters. 



Fig. 35: Tracking results of different trackers over some representative frames from the “Lola” video sequence in the scenarios with drastic scale 
changes and body pose variations. 




































20 



Fig. 36: Tracking results of different trackers over some representative frames from the “trace” video sequence in the scenarios with drastic body 
pose variations and shape deformations. 



Fig. 37: Tracking results of different trackers over some representative frames from the “Walk” video sequence in the scenarios with drastic camera 
motion, partial occlusions, and background clutters. 
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Fig. 38: Tracking results of different trackers over some representative frames from the “football” video sequence in the scenarios with small-sized 
targets and partial occlusions. 



Fig. 39: Tracking results of different trackers over some representative frames from the “iceball” video sequence in the scenarios with partial 
occlusions, out-of-plane rotations, body pose variations, and abrupt motion. 
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Fig. 40: Tracking results of different trackers over some representative frames from the “coke 11” video sequence in the scenarios with illumination 
changes, severe occlusions, out-of-plane rotations, and background clutters. 



Fig. 41: Tracking results of different trackers over some representative frames from the “trellis70” video sequence in the scenarios with drastic 
illumination changes and head pose variations. 
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Fig. 42: Tracking results of different trackers over some representative frames from the “dograce” video sequence in the scenarios with drastic pose 
changes and shape deformations. 
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Fig. 43: Tracking results of different trackers over some representative frames from the “football3” video sequence in the scenarios with motion 
blurring, partial occlusions, head pose variations, and background clutters. 
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Fig. 44: Tracking results of different trackers over some representative frames from the “cubicle” video sequence in the scenarios with severe 
occlusions, out-of-plane rotations, and head pose changes. 
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Fig. 45: Tracking results of different trackers over some representative frames from the “seq-jd” video sequence in the scenarios with severe 
occlusions, out-of-plane rotations, and head pose changes. 
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Fig. 46: Tracking results of different trackers over some representative frames from the “girl” video sequence in the scenarios with severe occlusions, 
out-of-plane & in-plane rotations, and head pose changes. 
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Fig. 47: Tracking results of different trackers over some representative frames from the “planeshow” video sequence in the scenarios with shape 
deformations, out-of-plane rotations, and pose variations. 



Fig. 48: Tracking results of different trackers over some representative frames from the “BMX-Street” video sequence in the scenarios with shape 
deformations, partial occlusions, and body pose changes. 



Fig. 49: Tracking results of different trackers over some representative frames from the “race” video sequence in the scenarios with background 
clutters. 
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Fig. 50: Tracking results of different trackers over some representative frames from the “CamSeqOl” video sequence in the scenarios with body 
pose changes. 



Fig. 51: Tracking results of different trackers over some representative frames from the “carll” video sequence in the scenarios with varying 
lighting conditions and background clutters. 
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XV. Discussion 

Based on the obtained experimental results, we observe that the proposed tracking algorithm has the following 
properties. First, after the buffer size exceeds a certain value (around 300 in our experiments), the tracking performance 
keeps stable with an increasing buffer size, as shown in Fig. 3. This is desirable since we do not need a large buffer size 
to achieve promising performance. Second, in contrast to many existing particle filtering-based trackers whose running 
time is typically linear in the number of particles, our method’s running time is sublinear in the number of particles, as 
shown in Fig. 3. Moreover, its tracking performance rapidly improves and finally converge to a certain value, as shown 
in Fig. 3. Third, as shown in Fig. 19 and Tab. VII, the performance of our metric learning with no eigendecomposition 
is close to that of computationally expensive metric learning with step-by-step eigendecomposition. Fourth, using the 
structured metric learning is capable of improving the tracking performance in CLE and VOR, as shown in Tab. V. That 
is because the structured metric learning encodes the underlying the structural interaction information on data samples, 
which plays an important role in robust visual tracking. Fifth, based on linear representation with metric learning, it 
performs better in tracking accuracy, as shown in Fig. 20. Sixth, it utilizes weighed reservoir sampling to effectively 
maintain and update the foreground and background sample buffers for metric learning, as shown in Fig. 21. Seventh, 
compared with other state-of-the-art trackers, it is capable of effectively adapting to complicated appearance changes in 
the tracking process by constructing an effective metric-weighted linear representation with weighed reservoir sampling, 
as shown in Fig. 32 and Tab. IV. 



